fix(client): self-healing for permanently stuck expired shape handles#4087
fix(client): self-healing for permanently stuck expired shape handles#4087KyleAMathews merged 11 commits intomainfrom
Conversation
When stale cache retries exhaust (3 attempts), clear the expired entry from localStorage and retry once without the expired_handle param. Since handles are never reused (SPEC.md S0), the fresh response gets a new handle and bypasses stale detection. This prevents shapes from being permanently unloadable when a proxy strips cache-buster query params. Also documents the server handle uniqueness guarantee (S0) in the spec, updates the loop-back table for the new self-healing path, and resets the recovery guard on up-to-date so self-healing remains available for long-lived streams. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
commit: |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4087 +/- ##
=======================================
Coverage 88.78% 88.78%
=======================================
Files 25 25
Lines 2452 2471 +19
Branches 615 627 +12
=======================================
+ Hits 2177 2194 +17
- Misses 273 275 +2
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…test The test waited for fast-loop detection to error, but the exponential backoff (100ms-5s across 5 detections) takes longer than the timeout in CI. Simplified to verify self-healing fires and the entry is cleared — the fast-loop error path is already tested in stream.test.ts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The #expiredShapeRecoveryKey guard was only cleared in #onMessages when an up-to-date batch arrived. The 204 backward-compatibility path transitions directly to LiveState without going through #onMessages (empty body → batch.length === 0 → early return), leaving the guard stuck. This prevented a second self-healing cycle on the same stream instance. Clear the guard in #onInitialResponse when the response transitions directly to live (action=accepted, state=live), covering the 204 path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…test The caughtError===null assertion was environment-sensitive: the fast-loop detector's 500ms window can catch more requests on slower machines, firing a 502 that's orthogonal to the recovery guard bug being tested. The precise signal is selfHealCount===2: if the guard is stuck, the code throws 502 *before* incrementing, so selfHealCount stays at 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code ReviewOverviewThe core fix is correct. I traced the stale-retry → self-healing flow end-to-end and verified body cancellation, guard clearing race-freedom, and reset ordering all hold. Suggestions1. The 204 guard-clearing check is looser than the comment suggests ( if (transition.action === `accepted` && this.#syncState.kind === `live`) {
this.#expiredShapeRecoveryKey = null
}The comment says "e.g., 204" but the check fires whenever
Not blocking — defensive hardening. 2. The comment on 3. Self-healing silently accepts stale data The "CDN always returns stale handle" case documents that after self-healing clears the expired entry, the client happily accepts the same stale response. The trade-off (stale data > permanent 502) is the right call, but there's no post-hoc signal to the user that they're on stale data. Consider logging a follow-up warning when a post-self-healing response arrives with the handle that was just marked-and-cleared. Low priority — could be a separate PR if telemetry becomes a need. Test concerns1. Timing-sensitive The test history ( 2. The 204-path test is doing too much ( 105 lines of mock logic with multi-phase state (
3. Mutating RecommendationApprove with minor suggestions. Fix #1 (tighten the 204 check) and #2 (comment update) are cheap safety wins worth doing before merge. The rest are non-blocking. |
Addresses three findings from external review of the self-healing PR: 1. Detect and warn when a post-self-heal response carries the same handle we just marked expired. Previously the client silently accepted stale data with no operator signal — now it emits a targeted warning naming the handle and pointing at the proxy cache-key misconfiguration that causes this. 2. Tighten the recovery-guard clearing check from `accepted + kind === live` to an explicit `status === 204`, matching the comment's intent and removing latent fragility if the state machine ever starts transitioning to live for non-204 responses. 3. Update the `#reset()` comment to list all three callers (#requestShape's 409 handler, #checkFastLoop, and stale-retry self-healing) instead of only the 409 handler. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two orthogonal CI fixes for the expired-shapes-cache test suite:
1. Silence stderr noise: many tests in this suite intentionally trigger
stale-cache scenarios that produce expected console.warn output.
Install a shared beforeEach spy that mocks console.warn for all
tests (tests that need to assert on warnings still can, via
warnSpy.mock.calls).
2. Prevent unhandled FetchError(502) from fast-loop detector:
- "CDN always returns stale handle" test: add onError handler and
poll until self-heal fires, then abort explicitly so the stream
can't loop in the background until #checkFastLoop throws.
- "204 recovery guard" test: add onError handler so the fast-loop
detector (which the test's own comment acknowledges may race
on slow runners) can't leak as an unhandled rejection.
Both tests still assert the same signals — this change affects test
plumbing only.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Package-local prettier 3.3.3 and root prettier 3.6.2 format markdown blank-lines-before-lists differently. CI runs root prettier (strips the line); the pre-commit hook picks up package-local prettier (re-adds it), causing lint-staged to think the commit is empty when trying to fix CI. Changesets auto-generates CHANGELOG files, so formatting them manually isn't needed anyway. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Package-local prettier 3.3.3 formats markdown blank-lines-before-lists differently than root prettier 3.6.2. CI runs root prettier; the pre-commit hook picks up package-local prettier — they disagreed, causing lint-staged to think every changelog fix was an empty commit. - Bumped prettier to ^3.6.2 in experimental, react-hooks, start, typescript-client, y-electric, burn/assets, and redis packages (aligning with root 3.6.2; identified via sherif). - Reverted the temporary CHANGELOG.md entry in .prettierignore now that the root cause is fixed. - Reformatted both package CHANGELOG files with 3.6.2 to match CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Using bare "prettier --write" in lint-staged lets nano-spawn's PATH resolution pick up globally-installed prettier (e.g. ~/.bun/bin/prettier 3.4.2) over the workspace's 3.6.2. That global version disagrees with CI on markdown list formatting, causing the pre-commit hook to silently revert the fix and then fail with "empty commit". Pointing explicitly at node_modules/.bin/prettier guarantees lint-staged uses exactly the version pnpm installed. Also reformats the two CHANGELOGs that CI's format:check flagged.
|
This PR has been released! 🚀 The following packages include changes from this PR:
Thanks for contributing to Electric! |
Summary
Expired shape handle entries in localStorage can get permanently stuck, preventing data from ever loading for affected shapes. This adds a self-healing retry mechanism that clears the poisoned entry and retries once, allowing automatic recovery even when a proxy strips cache-buster query parameters.
Based on #4085 by @evan-liveflow — refined with additional hardening from code review.
Root Cause
When a shape gets a 409 (handle rotation), the client stores the old handle in
localStorage['electric_expired_shapes']. On future requests, if a response contains that handle, the client treats it as a stale cached response and retries up to 3 times with cache-buster params.The problem: if a proxy (e.g., phoenix_sync) strips query parameters, the cache busters are ineffective. All 3 retries fail,
FetchError(502)is thrown toonError, and ifonErrordoesn't retry, the stream dies. The expired entry persists in localStorage, so the next session hits the same wall — permanently.Since the server never reuses handles (now documented as SPEC.md S0), the expired entry becomes a false positive once the caching layer clears — but the client has no way to discover this.
Approach
After stale cache retries exhaust (3 attempts), the client now:
expired_handleparam. Since handles are never reused, the fresh response will have a new handle and won't trigger stale detection#expiredShapeRecoveryKey(once per shape key, reset on up-to-date)Key Invariants
#expiredShapeRecoveryKeyguard)Non-goals
onErrorcontract — the fix works regardless of what the user'sonErrorcallback doesVerification
Files changed
src/client.ts#onInitialResponse, recovery key cleared on up-to-date, updated catch block commenttest/expired-shapes-cache.test.tsSPEC.md.changeset/fix-expired-shapes-self-healing.mdBased on #4085